Expand projection.md with memory projection and performance details#577
Closed
araina-amd wants to merge 10 commits intorelease/v26.2from
Closed
Expand projection.md with memory projection and performance details#577araina-amd wants to merge 10 commits intorelease/v26.2from
araina-amd wants to merge 10 commits intorelease/v26.2from
Conversation
…32B Configs for MI300X & MI355X (#556) YF: Only SFT related config and Doc changes, bypassing unit CI tests ## Summary This PR introduces post-training documentation and updates Qwen3 32B model configuration files to support AMD MI300X and MI355X accelerators. --- ## Changes ### 📘 Documentation - **Added `posttraining.md`** - New comprehensive guide for post-training workflows - Covers setup instructions, configuration details, and usage examples - **Updated `docs/README.md`** - Added a new section referencing post-training documentation - Improved documentation organization and navigation --- ### ⚙️ Configuration Updates - **Updated Qwen3_32B model YAML configs** - Added/modified configurations optimized for: - MI300X - MI355X - Adjusted parameters for compatibility and stable execution --- ## Validation - Verified updated configs load and execute successfully on MI300X and MI355X environments - Confirmed documentation links and structure render correctly --- ## Checklist - [x] Added `posttraining.md` - [x] Updated `docs/README.md` - [x] Modified Qwen3_32B YAML configs - [x] Verified changes locally
Adds a patch to fix Megatron FSDP compatibility with PyTorch 2.10+. The patch updates get_mesh_names to use the new DeviceMesh API (_get_root_mesh() and _flatten_mapping) instead of the deprecated _mesh_resources.child_to_root_mapping removed in PyTorch 2.10. The patch is automatically applied when use_megatron_fsdp is enabled. Co-authored-by: WangLingxun <linxwang@amd.com>
Adds support for CPU initialization in Primus Turbo linear layers (RowParallelLinear, ColumnParallelLinear, and LayerNormLinear). When use_cpu_initialization is enabled, the patch disables custom init methods by passing a no-op lambda, allowing Megatron's CPU initialization to work correctly with Primus Turbo's custom layer implementations. Co-authored-by: WangLingxun <linxwang@amd.com>
Previously, the evaluation loss was computed per iteration and overwritten, leading to incorrect averaging when multiple eval iterations are used. This fix accumulates the numerator and denominator separately across all eval iterations and computes the final average at the end.
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
…odes (#554) ### Changes: Only flag imbalance if the COUNT of GPUs on each node differs. Example: 4 on Node 0, 4 on Node 1 -> counts=[4,4] -> set={4} -> len=1 -> NOT imbalanced. 7 on Node 0, 1 on Node 1 -> counts=[7,1] -> set={7,1} -> len=2 -> Imbalanced. ### Reason for changes: The previous logic would issue a NUMA imbalance warning if not all GPUs were connected to the same node, resulting in a false positive when using a multi-socket CPU. --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Updates the AINIC Docker build inputs and adjusts the pretrain launcher to disable HipBLASLt tuning by default (to avoid profiler/TE issues), while also extending CI to build an additional v25.09 AINIC image variant. Changes: - Disable HipBLASLt tuning by default in run_pretrain.sh, requiring an explicit opt-in env var to enable it. - Bump the AINIC bundle used by the AINIC Docker image from a-38 to a-56. - Update CI to use the new bundle and add a new -v25.09-ainic image build/push step.
- Introduced a new class, ElapsedAverageExtension, to calculate and inject the running average of elapsed time per iteration (ms) into training logs. - Updated TrainingLogInfo to include elapsed_index for tracking elapsed time segments. - Enhanced log parsing to support the new elapsed time metrics. - Modified patch_training_log_unified to integrate the new extension alongside existing memory and throughput statistics. --------- Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: wenxie-amd <wen.xie@amd.com>
Ensure --key=value and --key value are parsed consistently so runtime config overrides apply correctly
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Expand projection.md with memory projection and performance details